Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bin_df_cols leave input df unchanged #192

Merged
merged 4 commits into from
Aug 7, 2024

Conversation

janosh
Copy link
Owner

@janosh janosh commented Aug 7, 2024

No description provided.

@janosh janosh added enhancement Improvement to existing features/functionality ux User experience labels Aug 7, 2024
@janosh janosh enabled auto-merge (squash) August 7, 2024 20:43
@janosh janosh merged commit 0d8ec52 into main Aug 7, 2024
7 checks passed
@janosh janosh deleted the bin-df-cols-leave-input-df-unchanged branch August 7, 2024 20:43
@@ -21,7 +21,7 @@ def count_elements(
values: ElemValues,
count_mode: ElemCountMode = ElemCountMode.composition,
exclude_elements: Sequence[str] = (),
fill_value: float | None = 0,
fill_value: float | None = None,
Copy link
Collaborator

@DanielYang59 DanielYang59 Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janosh This seem to cause the behaviour change in #215, is this an accidental change? I think the original default value of 0 is correct here? i.e. using 0 for element this is missing?

fill_value (float | None): Value to fill in for missing elements. Defaults to 0.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this! 🙏 i think returning None for missing elements in count_elements is more correct/semantic than 0. so ideally, ptable_heatmap_ratio should be updated to work with the new behavior

Copy link
Collaborator

@DanielYang59 DanielYang59 Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think returning None for missing elements in count_elements is more correct/semantic than 0

I guess both choices hold merit from certain standpoint, i.e. None is more semantic to suggest that particular element is missing, 0 is also valid as its count is indeed zero.

For the particular function, however, I slight prefer using zero because it seems to be designed to "count element occurrence":

"""Count element occurrence in list of formula strings or dict-like compositions.
If passed values are already a map from element symbol to counts, ensure the
data is a pd.Series filled with zero values for missing element symbols.

So its occurrence would be zero if that particular element doesn't appear in the data.


ideally, ptable_heatmap_ratio should be updated to work with the new behavior

I'm afraid this would have broader impact as there're quite a few usage that need to be checked:
image

For example:

from pymatviz import count_elements
from pymatviz.enums import ElemCountMode, Key
from matminer.datasets import load_dataset


dataset = load_dataset("matbench_expt_gap")[Key.composition]
counted_dataset = count_elements(dataset, ElemCountMode.composition, fill_value=0)
print(counted_dataset)

With fill value zero:

symbol
H      235.00
He       0.00
Li     967.66
Be     121.00
B     1168.00
       ...   
Fl       0.00
Mc       0.00
Lv       0.00
Ts       0.00
Og       0.00

With None:

symbol
H      235.00
He        NaN
Li     967.66
Be     121.00
B     1168.00
       ...   
Fl        NaN
Mc        NaN
Lv        NaN
Ts        NaN
Og        NaN

Using None seem to require an additional process step .

So do we want to change this completely or revert this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement to existing features/functionality ux User experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants